IBM6250 Group Project - Coffee Vending

Author

Group 1

Order Task / Deliverable Owner Details
1 Input / Intro section Jarrod Project overview and report framing
2 Exploratory Data Analysis (EDA) Min Data cleaning, visualization, descriptive stats
3 Modeling Eunice Build and validate predictive models
4 Forecast Ceren Predict overall revenue, drink consumption, and ingredient usage

Introduction

Effective inventory control for coffee-vending machines hinges on anticipating weekly ingredient consumption while avoiding costly spoilage. We forecast demand using historical sales from two machines, delivering eight-week projections that guide stock levels and reorder cadence.

This report:

  1. Imports & cleans transaction data from two coffee-vending machines.
  2. Explores key demand drivers.
  3. Models weekly sales with Seasonal ARIMA (plus Prophet as a benchmark).
  4. Delivers eight-week forecasts and stocking recommendations.

Data Input and Combining

Kaggle data is from two vending machines. Below we will import the two datasets and combine them.

Raw Transaction Data

Transaction data was taken from the following Kaggle link:

https://www.kaggle.com/datasets/ihelon/coffee-sales

machine1 <- read_csv("data/index_1.csv") %>% 
  mutate(machine_id = "machine1")

sales <- machine1 |>
  mutate(date = as_date(date),
         datetime = as_datetime(datetime),
         coffee_name=toupper(coffee_name))


skim(sales)
Data summary
Name sales
Number of rows 3636
Number of columns 7
_______________________
Column type frequency:
character 4
Date 1
numeric 1
POSIXct 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
cash_type 0 1.00 4 4 0 2 0
card 89 0.98 19 19 0 1316 0
coffee_name 0 1.00 5 19 0 8 0
machine_id 0 1.00 8 8 0 1 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2024-03-01 2025-03-23 2024-10-06 381

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
money 0 1 31.75 4.92 18.12 27.92 32.82 35.76 40 ▁▃▅▃▇

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
datetime 0 1 2024-03-01 10:15:50 2025-03-23 18:11:38 2024-10-07 02:55:12 3636
unique(machine1$coffee_name)
[1] "Latte"               "Hot Chocolate"       "Americano"          
[4] "Americano with Milk" "Cocoa"               "Cortado"            
[7] "Espresso"            "Cappuccino"         

Products and Ingredients

In the dataset, only product names are given. In order to more accurately predict what ingredients are needed and when, we must decompose the product into its ingredients. See below for the assumptions made for each of the 8 unique products.

Unique Products

unique(machine1$coffee_name)%>%sort()
[1] "Americano"           "Americano with Milk" "Cappuccino"         
[4] "Cocoa"               "Cortado"             "Espresso"           
[7] "Hot Chocolate"       "Latte"              

Ingredients

recipes <- tribble(
  ~coffee_name,            ~coffeeG, ~milkML, ~chocolateG,  ~sugarG, ~vanillaML,
  "AMERICANO",                18,       0,          0,         0,        0,
  "AMERICANO WITH MILK",      18,      60,          0,         0,        0,
  "CAPPUCCINO",               18,     100,          0,         0,        0,
  "COCOA",                     0,     240,         22,        15,        0,
  "CORTADO",                  18,      60,          0,         0,        0,
  "ESPRESSO",                 18,       0,          0,         0,        0,
  "HOT CHOCOLATE",             0,     240,         30,        20,        0,
  "LATTE",                    18,     240,          0,         0,       10
)
Ingredient Logic
Drink Ingredient-logic rationale
Espresso Straight double shot: 18 g ground coffee, no additives (Specialty Coffee Association 2018).
Americano Same 18 g espresso diluted with ≈ 4 × its volume of hot water; nothing else required (Specialty Coffee Association 2018).
Americano with Milk Americano softened with ≈ 60 ml steamed milk – enough to mellow bitterness without turning it into a latte (Cordell 2024).
Cappuccino Classic 1 : 1 : 1 build – espresso, ≈ 60 ml steamed milk, equal micro-foam – fills a 150–180 ml cup (Raffii 2024).
Cortado Spanish “cut” drink: equal parts double espresso and ≈ 60 ml steamed milk (Wine Editors 2025).
Latte U.S. latte stretches the shot with ≈ 240 ml milk (1 : 4–5 ratio); vanilla version adds 10 ml syrup (≈ 2 pumps) (Coffee Bros. 2024; Page 2025).
Cocoa Non-coffee mix: 240 ml milk + 22 g cocoa powder + 15 g sugar – standard stovetop proportions (Hersheyland Test Kitchen 2025).
Hot Chocolate Richer café blend: same milk but 30 g cocoa and 20 g sugar for modern sweetness level (Hersheyland Test Kitchen 2025).

Combining Transaction Data and Recipies

Below we will join the two tables on the coffee name, which will add ingredients to all rows in the transaction data. Explore the data we will use in our analysis below:

sales_ingredients <- sales |>
  left_join(recipes, by = "coffee_name") |>
  replace_na(list(
    coffee = 0, milk = 0, chocolate = 0, caramel = 0,
    whiskey = 0, tea = 0, vanilla = 0
  ))

Converting to Weekly Series

We aggregate to a weekly time series because the business decisions we are informing, like re-ordering coffee, milk, chocolate, etc, are made on a weekly cadence. Collapsing daily transactions into weeks smooths out erratic, day-to-day swings leaving a cleaner signal that aligns directly with the quantity we must predict.

We will also convert to a time series type object and verify it has no gaps in the series. If we see FALSE from .gaps, then we have no gaps.

weekly_sales <- sales_ingredients |>
  mutate(week = lubridate::floor_date(date, unit = "week")) |>
  group_by(week) |>
  summarise(across(coffeeG:vanillaML, sum, na.rm = TRUE),
            sales_n = n()) |>
  ungroup()

weekly_sales <- weekly_sales|>
  as_tsibble(index = week)

has_gaps(weekly_sales)
# A tibble: 1 × 1
  .gaps
  <lgl>
1 FALSE

Final Data for use in Analysis

Weekly ingredient demand vs. cups sold

Exploratory Data Analysis(EDA)

Total Revenue Series

#weekly revenue tsibble
weekly_revenue <- sales %>%
  mutate(week = floor_date(date, unit = "week")) %>%
  group_by(week) %>%
  summarise(total_revenue = sum(money, na.rm = TRUE), .groups = "drop") %>%
  as_tsibble(index = week)
# weekly total revenue time series plot
autoplot(weekly_revenue, total_revenue) +
  labs(title = "Weekly Total Revenue",
       y = "Revenue ($)",
       x = "Week") +
  theme_minimal()

#Interactive weekly revenue
weekly_revenue %>%
  ggplot(aes(x = week, y = total_revenue)) +
  geom_line(color = "steelblue", size = 1.2) +
  labs(title = "Interactive Weekly Revenue",
       y = "Revenue ($)", x = NULL) +
  theme_minimal() -> p

ggplotly(p)
#STL Decomposition
decomposed_revenue <- weekly_revenue %>%
  model(STL(total_revenue ~ season() + trend(), robust = TRUE)) %>%
  components()

# plot component
autoplot(decomposed_revenue) +
  labs(title = "STL Decomposition of Weekly Revenue",
       x = "Week", y = NULL) +
  theme_minimal()

#Differencing
diff_revenue <- weekly_revenue %>%
  mutate(diff_revenue = difference(total_revenue, lag = 1))  

autoplot(diff_revenue, diff_revenue) +
  labs(title = "First Difference of Weekly Revenue",
       y = "Differenced Revenue",
       x = "Week") +
  theme_minimal()

# ACF
diff_revenue %>%
  ACF(diff_revenue, lag_max = 30) %>%
  autoplot() +
  labs(title = "ACF of Differenced Revenue")

# Summary stats
summary_stats <- weekly_revenue %>%
  summarise(
    mean_revenue = mean(total_revenue, na.rm = TRUE),
    median_revenue = median(total_revenue, na.rm = TRUE),
    max_revenue = max(total_revenue, na.rm = TRUE),
    min_revenue = min(total_revenue, na.rm = TRUE)
  )

# Highest Weekly Sales
top_week <- weekly_revenue %>%
  filter(total_revenue == max(total_revenue, na.rm = TRUE))
top_week
# A tsibble: 1 x 2 [7D]
  week       total_revenue
  <date>             <dbl>
1 2024-10-06         3546.
datatable(summary_stats, caption = "Weekly Revenue Summary Stats")

Data Description & Exploratory Summary

  • Dataset Overview

This dataset is derived from the raw sales table and represents the total weekly revenue, aggregated across all transactions from coffee vending machines, covering the period from March 2024 to March 2025, with data on 8 types of coffee.

  • EDA Findings
  1. The weekly time series is complete, with no missing weeks.

  2. The revenue shows a distinct trend composed of three major peaks—a pattern resembling “three hills”.

  3. STL decomposition confirmed a clear trend but no significant seasonality was detected.

  4. After first-order differencing, the series became stationary, with the ACF showing white noise.

  • Key Observations
  1. Highest revenue: Week of October 6, 2024 — $3,546

  2. Lowest revenue: Week of March 23, 2025 — $204.76

  3. The fluctuation pattern suggests possible external drivers (e.g., promotions or demand cycles).

  • Assumptions
  1. Revenue can be modeled primarily as a trend-driven process.

  2. The data-generating process in the next 8 weeks will be similar to the recent 8 weeks.

  3. Weekly sales will grow in the next week.

Coffee Type Series

#sales of each coffee type
weekly_coffee_volume <- sales %>%
  mutate(week = floor_date(date, unit = "week")) %>%
  group_by(week, coffee_name) %>%
  summarise(
    total_volume = n(),  
    .groups = "drop"
  ) %>%
  as_tsibble(index = week, key = coffee_name)
# Weekly cups sold by coffee type
weekly_coffee_volume %>%
  autoplot(total_volume) +
  facet_wrap(~coffee_name, scales = "free_y") +
  labs(title = "Weekly Cups Sold by Coffee Type",
       y = "Number of Cups", x = "") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")

# Check NA
missing_data <- weekly_coffee_volume %>%
  group_by(coffee_name) %>%
  summarise(missing_weeks = sum(is.na(total_volume)))

#fill gap
weekly_volume_filled <- weekly_coffee_volume %>%
  fill_gaps(total_volume = 0)

decomposed_volume <- weekly_volume_filled %>%
  model(
    STL(total_volume ~ season(window = 52) + trend(window = 7), robust = TRUE)
  ) %>%
  components()

autoplot(decomposed_volume) +
  facet_wrap(~coffee_name, scales = "free_y") +
  labs(title = "STL Decomposition of Weekly Volume by Coffee Type",
       x = "Week", y = NULL) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")

#ACF
weekly_volume_filled %>%
  group_by(coffee_name) %>%
  group_walk(~{
    v <- .x$total_volume
    v <- v[is.finite(v)]

    if (length(v) < 2L || var(v) == 0) {
      message("Skipping ", .y$coffee_name, ": not enough variation.")
    } else {
      acf(v,
          main = paste("ACF for", .y$coffee_name),
          na.action = na.pass)
    }
  })

#Differencing
weekly_volume_diff <- weekly_volume_filled %>%
  group_by(coffee_name) %>%
  mutate(diff_volume = difference(total_volume)) %>%
  ungroup()
#check ACF
weekly_volume_diff %>%
  filter(!is.na(diff_volume)) %>%
  group_by(coffee_name) %>%
  group_walk(~{
    v <- .x$diff_volume
    v <- v[is.finite(v)]

    if (length(v) < 2L || var(v) == 0) {
      message("Skipping ", .y$coffee_name, ": not enough variation.")
    } else {
      acf(v,
          main = paste("ACF after differencing:", .y$coffee_name),
          na.action = na.pass)
    }
  })

weekly_volume_diff_2 <- weekly_volume_filled %>%
  group_by(coffee_name) %>%
  mutate(diff_volume_2 = difference(total_volume, lag = 2)) %>%
  ungroup()


weekly_volume_diff_2 %>%
  filter(!is.na(diff_volume_2)) %>%
  group_by(coffee_name) %>%
  group_walk(~{
    v <- .x$diff_volume_2
    v <- v[is.finite(v)]

    if (length(v) < 2L || var(v) == 0) {
      message("Skipping ", .y$coffee_name, ": not enough variation.")
    } else {
      acf(v,
          main = paste("ACF after second differencing:", .y$coffee_name),
          na.action = na.pass)
    }
  })

# weekly coffee type average volumn sold rank
weekly_coffee_volume %>%
  group_by(coffee_name) %>%
  summarise(avg_volume = mean(total_volume, na.rm = TRUE)) %>%
  arrange(desc(avg_volume)) %>%
  DT::datatable(caption = "Average Weekly Sales Volume by Coffee Type")

Data Description & Exploratory Summary

  • Dataset Overview

This dataset represents the total weekly sales volume (cups sold) for 8 different coffee types from March 2024 to March 2025.

  • EDA Findings
  1. The dataset contains NA values for some weeks and coffee types but is otherwise complete.

  2. After first-order differencing, most coffee types became stationary, except for Americano with Milk and Latte, which still showed a trend.

  3. No significant seasonality was detected in the sales volumes after STL decomposition.

  • Key Observations
  1. Highest Sales Volume: Americano with Milk peaked in the week of February 16, 2025.

  2. Average Weekly Volume: The average sales volume for all coffee types was around 42 cups per week.

  • Assumptions
  1. Americano with Milk will remain the top-selling coffee over the next 8 weeks, continuing its strong performance observed historically.

  2. Customer preferences may shift across coffee types due to weather.

  3. Growth patterns, if any, will be gradual, not abrupt—no coffee type is expected to suddenly double or halve in volume.

Ingredient Series

# Weekly Ingredient Usage visualization
weekly_long %>%
  filter(metric != "sales_n") %>% 
  autoplot(value) +
  facet_wrap(~metric, scales = "free_y") +
  labs(title = "Weekly Ingredient Usage",
       y = "Amount Used",
       x = "") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")

#STL decomposition
decomp_recipe <- weekly_long %>%
  filter(metric != "sales_n") %>%
  model(
    STL(value ~ trend(window = 7), robust = TRUE)
  ) %>%
  components()

autoplot(decomp_recipe) +
  facet_wrap(~metric, scales = "free_y") +
  labs(title = "STL Decomposition of Ingredient Usage") +
  theme_minimal() +
  theme(legend.position = "none")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#differencing
diff_recipe <- weekly_long %>%
  filter(metric != "sales_n") %>%
  mutate(diff_value = difference(value))

diff_recipe %>%
  autoplot(diff_value) +
  facet_wrap(~metric, scales = "free_y") +
  labs(title = "First-Order Differenced Ingredient Usage") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  theme(legend.position = "none")

#ACF 
diff_recipe %>%
  filter(!is.na(diff_value)) %>%
  ACF(diff_value) %>%
  autoplot() +
  facet_wrap(~metric, scales = "free_y") +
  labs(title = "ACF of First-Order Differenced Ingredient Usage") +
  theme_minimal()

ingredient_summary <- weekly_long %>%
  filter(metric != "sales_n") %>% 
  group_by(metric) %>%
  summarise(
    avg_weekly_usage = mean(value, na.rm = TRUE),
    min_weekly_usage = min(value, na.rm = TRUE),
    max_weekly_usage = max(value, na.rm = TRUE),
    total_usage = sum(value, na.rm = TRUE),
    weeks_count = n()
  ) %>%
  arrange(desc(avg_weekly_usage))

ingredient_summary
# A tsibble: 285 x 7 [7D]
# Key:       metric [5]
   metric week       avg_weekly_usage min_weekly_usage max_weekly_usage
   <chr>  <date>                <dbl>            <dbl>            <dbl>
 1 milkML 2024-10-13            16420            16420            16420
 2 milkML 2024-10-06            15080            15080            15080
 3 milkML 2024-10-20            14860            14860            14860
 4 milkML 2024-11-03            11860            11860            11860
 5 milkML 2025-02-02            11860            11860            11860
 6 milkML 2024-05-26            11340            11340            11340
 7 milkML 2024-10-27            11220            11220            11220
 8 milkML 2025-02-09            11000            11000            11000
 9 milkML 2024-11-10            10900            10900            10900
10 milkML 2025-03-16            10900            10900            10900
# ℹ 275 more rows
# ℹ 2 more variables: total_usage <dbl>, weeks_count <int>

Data Description & Exploratory Summary

  • Dataset Overview

This dataset represents weekly ingredient usage for a coffee vending machines from March 2024 to March 2025, including key ingredients like milk, coffee, chocolate, sugar, and vanilla.

  • EDA Findings
  1. The usage of milk exhibits the highest variation and fluctuation among all ingredients, especially when analyzed via STL decomposition.

  2. Other ingredients, such as coffee and chocolate, show more stationary patterns in their weekly usage.

  3. After first-order differencing, all time series (except milk) become stationary, with ACF plots showing no significant autocorrelation, indicating random fluctuations.

  • Key Observations
  1. As the ingredient with the highest demand, it shows significant volatility.

  2. Coffee shows a slightly fluctuating trend, being the second highest in demand after milk.

  • Assumptions
  1. Given that milk consistently shows the highest demand, it is assumed that milk will continue to be the ingredient with the highest usage in the next 8 weeks.

  2. Ingredients like coffee and chocolate will likely maintain stable usage, though minor fluctuations may occur based on consumer trends or events.

Modeling Process

We model each weekly ingredient time series using Seasonal ARIMA (SARIMA), selected for its ability to capture both autoregressive and seasonal dynamics that is present in our exploratory data analysis. We want to produce stable 8-week forecasts to inform weekly inventory restocking decisions. Prophet is used as a secondary benchmark model to compare predictive accuracy.

# Reshaping to time series per ingredient
ingredients <- c("coffeeG", "milkML", "chocolateG", "sugarG", "vanillaML")

ingredient_ts <- weekly_sales %>%
  select(week, coffeeG, milkML, chocolateG, sugarG, vanillaML) %>%
  pivot_longer(-week, names_to = "ingredient", values_to = "amount") %>%
  as_tsibble(index = week, key = ingredient)
# Long format tsibble
weekly_long_ingredients <- weekly_sales %>%
  pivot_longer(cols = coffeeG:vanillaML, names_to = "ingredient", values_to = "value") %>%
  as_tsibble(index = week, key = ingredient)

# Fitting ARIMA to each ingredient using key grouping
ingredient_models <- weekly_long_ingredients %>%
  model(ARIMA(value))
# Fitting Seasonal ARIMA (SARIMA) to each ingredient series
arima_models <- ingredient_ts %>%
  model(ARIMA_Model = ARIMA(amount ~ season()))
#install.packages("prophet")
library(prophet)

# Converting ARIMA input to Prophet format for each ingredient
prophet_data <- ingredient_ts %>%
  as_tibble() %>%
  rename(ds = week, y = amount) %>%
  split(.$ingredient)

# Function to fit Prophet and generate 8-week forecast
fit_prophet <- function(df) {
  m <- prophet(df[, c("ds", "y")], weekly.seasonality = TRUE, yearly.seasonality = FALSE, daily.seasonality = FALSE)
  future <- make_future_dataframe(m, periods = 8, freq = "week")
  forecast <- predict(m, future)
  forecast$ingredient <- unique(df$ingredient)
  forecast
}

# Applying Prophet to all ingredients
prophet_forecasts <- map_df(prophet_data, fit_prophet)

# Actual + Forecasted values for plotting
prophet_combined <- ingredient_ts %>%
  as_tibble() %>%
  rename(ds = week, y = amount) %>%
  bind_rows(
    prophet_forecasts %>%
      select(ds, yhat, ingredient) %>%
      rename(y = yhat)
  )

# Plot Prophet forecasts for all ingredients
prophet_combined %>%
  ggplot(aes(x = ds, y = y, color = ingredient)) +
  geom_line() +
  facet_wrap(~ingredient, scales = "free_y") +
  labs(title = "8-Week Prophet Forecasts per Ingredient", x = "Week", y = "Predicted Amount") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

To benchmark SARIMA models, we applied Prophet to all five ingredients. Prophet captures trend and seasonality non-parametrically. This is useful in business forecasting where abrupt changes or flexible curves are expected. 8-week forecasts were generated.

Model Evaluation and Diagnostics

For each ingredient, we examined residual plots from SARIMA to assess randomness, autocorrelation, and normality.

# Perform Ljung-Box test on SARIMA residuals
arima_models %>%
  augment() %>%
  features(.resid, ljung_box, lag = 8) %>%
  datatable(caption = "Ljung-Box Test Results for ARIMA Residuals")

Generate Forecasts

forecast_ingredients <- ingredient_models %>%
  forecast(h = "8 weeks")

autoplot(forecast_ingredients, weekly_long_ingredients) +
   labs(title = "8-Week Ingredient Forecasts (Seasonal ARIMA)",
       y = "Amount", x = "Week") +
  facet_wrap(~ingredient, scales = "free_y") +
  theme_minimal()

Executive Summary with Actionable Recommendations

Appendix

Machine 1

Machine 1 Weekly ingredient demand vs. cups sold

References

Coffee Bros. 2024. “Milk-to-Espresso Ratio Calculator.” 2024. https://coffeebros.com/pages/milk-to-espresso-ratio-calculator.
Cordell, George. 2024. “Americano Coffee with Milk Recipe.” 2024. https://coffeelikers.com/americano-coffee-with-milk/.
Hersheyland Test Kitchen. 2025. “Hot Cocoa for One.” 2025. https://www.hersheyland.com/recipes/hot-cocoa-for-one.html.
Page, Amazon Product. 2025. “Torani Vanilla Syrup Pump – 8 Ml Per Pump.” 2025. https://www.amazon.com/dp/B09P49HWKK.
Raffii, Ahlam. 2024. “How to Make the Perfect Cappuccino.” 2024. https://www.thespruceeats.com/how-to-make-cappuccinos-766116.
Specialty Coffee Association. 2018. “Defining the Ever-Changing Espresso.” https://sca.coffee/sca-news/25-magazine/issue-3/defining-ever-changing-espresso-25-magazine-issue-3.
Wine Editors, Food &. 2025. “8 Types of Coffee Drinks to Order Around the World.” 2025. https://www.foodandwine.com/types-of-coffee-drinks-11724561.